PROJECT 4

In [198]:
#Importing libraries
import pandas as pd

#from sklearn.linear_model import LinearRegression
#from sklearn.linear_model import LogisticRegression

# importing ploting libraries
import matplotlib.pyplot as plt   

#importing seaborn for statistical plots
import seaborn as sns

#Let us break the X and y dataframes into training set and test set. For this we will use
#Sklearn package's data splitting function which is based on random function

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

import numpy as np


# calculate accuracy measures and confusion matrix
from sklearn import metrics
In [199]:
bank_df = pd.read_csv("bank-full.csv")
In [200]:
#printing top 10 rows
#bank_df.head(50)
In [201]:
#size of data
bank_df.shape
Out[201]:
(45211, 17)
In [202]:
bank_df.dtypes
Out[202]:
age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
Target       object
dtype: object

$\color{blue}{\text{ Here age,day,dration,campaign,pdays,previous are of int type using describe functions will give values for these columns only}}$

In [203]:
#Exporting to excel file for analysis
bank_df.describe().transpose().to_excel("out.xlsx", sheet_name='bank_describe')
#Displaying the descriptive statistical values
bank_df.describe().transpose()
Out[203]:
count mean std min 25% 50% 75% max
age 45211.0 40.936210 10.618762 18.0 33.0 39.0 48.0 95.0
balance 45211.0 1362.272058 3044.765829 -8019.0 72.0 448.0 1428.0 102127.0
day 45211.0 15.806419 8.322476 1.0 8.0 16.0 21.0 31.0
duration 45211.0 258.163080 257.527812 0.0 103.0 180.0 319.0 4918.0
campaign 45211.0 2.763841 3.098021 1.0 1.0 2.0 3.0 63.0
pdays 45211.0 40.197828 100.128746 -1.0 -1.0 -1.0 -1.0 871.0
previous 45211.0 0.580323 2.303441 0.0 0.0 0.0 0.0 275.0
In [204]:
#checking the value campaign greater than 50
bank_df[bank_df['campaign'] > 50] # count greater than 50 contacted :4
###bank_df[bank_df['campaign'] < 5] #count less than 5 times contacted :39092
Out[204]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome Target
4299 30 management single tertiary no 358 yes no unknown 19 may 88 51 -1 0 unknown no
4330 45 management married unknown no 9051 yes no unknown 19 may 124 63 -1 0 unknown no
5073 35 technician married secondary no 432 yes no unknown 21 may 1094 55 -1 0 unknown no
11914 24 technician single primary no 126 yes no unknown 20 jun 10 58 -1 0 unknown no
In [205]:
rowCount= 0
for row in bank_df[bank_df['pdays'] > 500]:
    rowCount = rowCount+1
print(str(rowCount))
    
17

There are 17 rows with pdays>500

$\color{red}{\text{Descriptive STATISTICS results :}}$

 mean min 25% 50% 75% max comments domain understanding
age 40.93621021 18 33 39 48 95 mean is slightly greater than the median which will result in right skwed curve the diffrence between the Q3 and max value is 46 there is a highly possibility of outlier do age range accpected from (18-100) is 95 can be an outlier
balance 1362.272058 -8019 72 448 1428 102127 mean is again thrice than the median curve is right skewed and the diffrence between Q3 and max is very high expected outliers in this variable
day 15.80641879 1 8 16 21 31 last contacted day of week makes an impact need to check
duration 258.1630798 0 103 180 319 4918 day month and duration will give last contact information
campaign 2.763840658 1 1 2 3 63 potential outliers 4 ppls got highly contacted in this Campaign checking the csv in this campign this ppl contacted .bank_df[bank_df['campaign'] > 50] is 4 records which means in this campaign 4 ppl contacted more than 50 times. Count for concat less than 5 is 39092
pdays 40.19782796 -1 -1 -1 -1 871 maximum values are -1 , count of rows greater than 500 values are 17 no of days last contacted is negative in some case
previous 0.580323373 0 0 0 0 275 no of concat performed with this client before this campaign mostly we have need to check wheather 0 is correct or it is also missing info.
In [206]:
#converting object type to category type
bank_df.dtypes
Out[206]:
age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
Target       object
dtype: object
In [207]:
#Exporting to excel file for analysis
initail_bank_df = bank_df # saving intial df to another df before applying categorical values
bank_df.head(5).to_excel("out_top5_raw.xlsx", sheet_name='bankdf')#in a xlsx file in a sheet
bank_df.head(5)
Out[207]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome Target
0 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
2 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
3 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no

Iteration 1: $\color{red}{\text{Working on categories }}$

In [208]:
#list unique job types
jobs = bank_df['job'].unique()
jobs
Out[208]:
array(['management', 'technician', 'entrepreneur', 'blue-collar',
       'unknown', 'retired', 'admin.', 'services', 'self-employed',
       'unemployed', 'housemaid', 'student'], dtype=object)
In [209]:
#counts of values occuring for a particular job in dataset
bank_df['job'].value_counts()
Out[209]:
blue-collar      9732
management       9458
technician       7597
admin.           5171
services         4154
retired          2264
self-employed    1579
entrepreneur     1487
unemployed       1303
housemaid        1240
student           938
unknown           288
Name: job, dtype: int64

$\color{red}{\text{Gathering knowlege on job types from domain expert(this case google :-P)}}$ White-collar work may be performed in an office or other administrative setting. ... Other types of work are those of a blue-collar worker, whose job requires manual labor and a pink-collar worker, whose labor is related to customer interaction, entertainment, sales, or other service-oriented work.


  1. which means management and admin related work is in the office limits hence can be replaced as white collar job count of management + count of admin = 9458 + 5171 = 14629 ---> white-collar
  2. and also the service-oriented task + housemaid task = 4154 + 1240 = 5394 ---> pink-collar
In [210]:
#replacing management and admin to white-collar
bank_df['job'] = bank_df['job'].replace(['management','admin.'],'white-collar')
#replacing service-oriented task(service) and housemaid to pink-collar
bank_df['job'] = bank_df['job'].replace(['services','housemaid'],'pink-collar')
bank_df['job'].value_counts()
Out[210]:
white-collar     14629
blue-collar       9732
technician        7597
pink-collar       5394
retired           2264
self-employed     1579
entrepreneur      1487
unemployed        1303
student            938
unknown            288
Name: job, dtype: int64

white collar and pink collar matches the count we expected unemployed, student, retired and unknown can be written as other since they are less likely to match the expectation and can b treated as one group however unknown can be dobtful needs domain expertise on that but since the number is less so lets consider it as $\color{red}{\text{others}}$

In [211]:
bank_df['job'] = bank_df['job'].replace(['student','unemployed','retired','unknown'],'others')
bank_df['job'].value_counts()
Out[211]:
white-collar     14629
blue-collar       9732
technician        7597
pink-collar       5394
others            4793
self-employed     1579
entrepreneur      1487
Name: job, dtype: int64

More domain analysis : (from google) :The Difference Between Entrepreneurs and the Self-Employed. ... Self-Employed - Working for oneself as a freelancer or the owner of a business rather than for an employer. Entrepreneur - A person who organizes and operates a business or businesses, taking on greater than normal financial risks in order to do so

$\color{red}{\text{No we cant add the self-employed and entrepreneur this field is categorized as per as i can do :-P}}$

***checking for next column $\color{red}{\text{martial}}$

In [212]:
#since it is perfectly simplified no need to categorized furthur
bank_df['marital'].unique()
Out[212]:
array(['married', 'single', 'divorced'], dtype=object)
In [213]:
#checking next categorial independent variable education
bank_df['education'].unique()
#this is already categorized
#after applying category type and code primary:1 , secondary:2, tertiary:3 , unknown: 4 by alphabetical order
Out[213]:
array(['tertiary', 'secondary', 'unknown', 'primary'], dtype=object)
In [214]:
#checking next categorical variable default
bank_df['default'].unique()
Out[214]:
array(['no', 'yes'], dtype=object)

$\color{red}{\text{Imp Note :}}$ when we convert default (object type) to categorical data type,no will convert to 0 because of the alphabetical order and yes will convert to 1 which is fine in this case

In [215]:
#checking the next categorical variable housing
bank_df['housing'].unique() # ---> array(['yes', 'no'], dtype=object) --->again the above note
bank_df['loan'].unique()# ---> array(['yes', 'no'], dtype=object) --->again the above note
bank_df['contact'].unique() #---> array(['unknown', 'cellular', 'telephone'], 
                        #dtype=object) -- cellular:0 telephone:1 unknown:2 
                            #according to alphabetical order
bank_df['month'].unique() #--> if month has relevance from jan to dec we can convert 1 to 12
Out[215]:
array(['may', 'jun', 'jul', 'aug', 'oct', 'nov', 'dec', 'jan', 'feb',
       'mar', 'apr', 'sep'], dtype=object)
In [216]:
#day and month last contacted are not relevant 
#dropping the columns
bank_df.drop('month', axis=1, inplace=True)
bank_df.drop('day', axis=1, inplace=True)
In [217]:
#checking the next categorical variables 
bank_df['pdays'].unique()  # contains a value -1 
#lets count the values of -1 for pdays in the data 
bank_df[bank_df['pdays'] == -1] # 36954 rows × 15 columns
bank_df.pdays.dtype # column is dtype('int64')
#cant delete this much rows lets replace the data -1 to a larger value say 10000
bank_df.loc[bank_df['pdays'] == -1,'pdays'] = 10000
# bank_df[bank_df['pdays'] == 10000] output : 36954 rows × 15 columns done successfully


#checking the next categorical variables
bank_df['previous'].unique()
bank_df['poutcome'].unique() # output :array(['unknown', 'failure', 'other', 'success'], dtype=object)
#unknown and other are not success nor failure (question to domain expert) replacing unknown and other as other
bank_df['poutcome'] = bank_df['poutcome'].replace(['unknown','other'],'other')
bank_df['Target'].unique() #array(['no', 'yes'], dtype=object) important note no : 0 yes : 1
Out[217]:
array(['no', 'yes'], dtype=object)
In [218]:
#converting object dtype to categorical replacing with values/codes
for feature in bank_df.columns:
    if bank_df[feature].dtype == 'object':
        bank_df[feature] =pd.Categorical(bank_df[feature]).codes
In [219]:
#Export to excel for analysis
bank_df.to_excel("out_data_cleaning.xlsx", sheet_name='bankdfCate')#same file diffrent sheet for comparision
bank_df.head(5)
Out[219]:
age job marital education default balance housing loan contact duration campaign pdays previous poutcome Target
0 58 6 1 2 0 2143 1 0 2 261 1 10000 0 1 0
1 44 5 2 1 0 29 1 0 2 151 1 10000 0 1 0
2 33 1 1 1 0 2 1 1 2 76 1 10000 0 1 0
3 47 0 1 3 0 1506 1 0 2 92 1 10000 0 1 0
4 33 2 2 3 0 1 0 0 2 198 1 10000 0 1 0
In [220]:
#checking for nulls
bank_df[bank_df.isnull().any(axis=1)].count()
Out[220]:
age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
Target       0
dtype: int64
In [221]:
bank_df.describe().transpose()
Out[221]:
count mean std min 25% 50% 75% max
age 45211.0 40.936210 10.618762 18.0 33.0 39.0 48.0 95.0
job 45211.0 3.524142 2.349361 0.0 2.0 4.0 6.0 6.0
marital 45211.0 1.167725 0.608230 0.0 1.0 1.0 2.0 2.0
education 45211.0 1.224813 0.747997 0.0 1.0 1.0 2.0 3.0
default 45211.0 0.018027 0.133049 0.0 0.0 0.0 0.0 1.0
balance 45211.0 1362.272058 3044.765829 -8019.0 72.0 448.0 1428.0 102127.0
housing 45211.0 0.555838 0.496878 0.0 0.0 1.0 1.0 1.0
loan 45211.0 0.160226 0.366820 0.0 0.0 0.0 0.0 1.0
contact 45211.0 0.640242 0.897951 0.0 0.0 0.0 2.0 2.0
duration 45211.0 258.163080 257.527812 0.0 103.0 180.0 319.0 4918.0
campaign 45211.0 2.763841 3.098021 1.0 1.0 2.0 3.0 63.0
pdays 45211.0 8214.689744 3777.244242 1.0 10000.0 10000.0 10000.0 10000.0
previous 45211.0 0.580323 2.303441 0.0 0.0 0.0 0.0 275.0
poutcome 45211.0 0.925018 0.369059 0.0 1.0 1.0 1.0 2.0
Target 45211.0 0.116985 0.321406 0.0 0.0 0.0 0.0 1.0
In [222]:
#Create a seperate dataframe consisting independent var (features)
bank_features_df = bank_df.drop(labels='Target', axis =1)
bank_features_df.head(5)
Out[222]:
age job marital education default balance housing loan contact duration campaign pdays previous poutcome
0 58 6 1 2 0 2143 1 0 2 261 1 10000 0 1
1 44 5 2 1 0 29 1 0 2 151 1 10000 0 1
2 33 1 1 1 0 2 1 1 2 76 1 10000 0 1
3 47 0 1 3 0 1506 1 0 2 92 1 10000 0 1
4 33 2 2 3 0 1 0 0 2 198 1 10000 0 1
In [223]:
sns.pairplot(bank_df, hue="Target")
/Users/appleapple/anaconda3/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
/Users/appleapple/anaconda3/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py:488: RuntimeWarning: invalid value encountered in true_divide
  binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
/Users/appleapple/anaconda3/lib/python3.7/site-packages/statsmodels/nonparametric/kdetools.py:34: RuntimeWarning: invalid value encountered in double_scalars
  FAC1 = 2*(np.pi*bw/RANGE)**2
/Users/appleapple/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py:83: RuntimeWarning: invalid value encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
Out[223]:
<seaborn.axisgrid.PairGrid at 0x1a8aee0048>
In [224]:
#from pair plot there a normalised curve forming in duration
#plotting individual plot for duration 
sns.distplot(bank_df['duration'],bins=100)
/Users/appleapple/anaconda3/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[224]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a7a39f278>
In [225]:
#splitting tha data to training set and test set 
X = bank_features_df #without target column
y = bank_df.pop("Target")
In [226]:
test_size = 0.30 # taking 70:30 training and test set
seed =1 # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)

Decision tree without Regularisation

In [230]:
dt_model = DecisionTreeClassifier(criterion= 'entropy')
dt_model.fit(X_train,y_train)
Out[230]:
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
In [231]:
#import features for DT
print(pd.DataFrame(dt_model.feature_importances_,columns=["Imp"],index =X_train.columns))
                Imp
age        0.125446
job        0.042087
marital    0.018883
education  0.030137
default    0.000850
balance    0.166603
housing    0.024716
loan       0.013260
contact    0.042856
duration   0.342099
campaign   0.044000
pdays      0.043512
previous   0.018702
poutcome   0.086850

Most important feature is duration LOL, can be true .... then balance which makes sense

In [196]:
#Googled this method of printing decision tree

from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(dt_model, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())
dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.313042 to fit

Out[196]:
In [233]:
#Predicting value
#calculating model score
#Printing the confusion matrix
y_pred = dt_model.predict(X_test)
model_score = dt_model.score(X_test, y_test)
print("Model Score")
print(model_score)
print("\n ===========")
print("Confusion Matrix")
print(metrics.confusion_matrix(y_test, y_pred))
Model Score
0.8629460336184016

 ===========
Confusion Matrix
[[11038   975]
 [  884   667]]

Decision tree with Regularisation

In [238]:
dt_model1 = DecisionTreeClassifier(criterion= 'entropy', max_depth=5)
dt_model1.fit(X_train,y_train)
Out[238]:
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
In [236]:
#Googled this method of printing decision tree


from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(dt_model1, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())
Out[236]:
In [242]:
#Predicting value
#calculating model score
#Printing the confusion matrix
y_pred = dt_model1.predict(X_test)

print("model score with training data X_test, y_test")

model_score = dt_model1.score(X_test, y_test)
print(model_score)

print("model score with test data X_test, y_pred")

model_score = dt_model1.score(X_test, y_pred)
print(model_score)


print("\n ===========")
print("Confusion Matrix")
print(metrics.confusion_matrix(y_test, y_pred))
model score with training data X_test, y_test
0.9012090828664111
model score with test data X_test, y_pred
1.0

 ===========
Confusion Matrix
[[11719   294]
 [ 1046   505]]

Random Forest Classifier

In [261]:
from sklearn.ensemble import RandomForestClassifier
rfcl = RandomForestClassifier(random_state =1,n_estimators = 6)
#using random_state so that random numbers generated should be same will give the same score evry time.
rfcl = rfcl.fit(X_train, y_train)
In [262]:
test_pred = rfcl.predict(X_test)
rfcl.score(X_test , y_test)
Out[262]:
0.894426422884105

Gradient boosting

In [265]:
from sklearn.ensemble import GradientBoostingClassifier
gbcl = GradientBoostingClassifier(n_estimators = 50, learning_rate = 0.09, max_depth=5)
gbcl = gbcl.fit(X_train, y_train)
In [266]:
test_pred = rfcl.predict(X_test)
rfcl.score(X_test , y_test)
Out[266]:
0.894426422884105

$\color{red}{\text{Results :}}$

Models Model Score Comment
Decision Tree without Regularisation 86% Decision tree is huge
Decision Tree with Regularisation(max depth=5) 90% Descision tree is small as depth is 5
Using Random forest Classifier(estimators =6) 89% Approximately equal to decision tree with regularisation
Using Gradient boosting(estimaters =50, learning rate=0.09, max depths=5) 89% Approximately equal to decision tree with regularisation and Random forest

$\color{blue}{\text{ Questions :}}$

  1. How to check the model will perform well in production?
  2. Few of the last level decision tree has entropy 1 or nearly equal to 1 but still model accuracy is 90%
  3. In my decision tree i can see the values of independent variables as X0 X1 etc how to set actual labels.
  4. The column i dropped based on my understanding are really ment to drop or there can be hidden pattern associated with them.
  5. Three models are giving same score is there any way to diffrentiate?

  6. In videos for decision tree :

    1. score is calculated with test_set and predected values for decision tree
    2. score is calculated with test_set and test_label values for random forest not with predected values
    3. score is calculated with test_set and test_label values for Gradient boosting. Why?

Links referred:

  1. for printing decision tree this blog click here
  2. Collar jobs wiki link 3